Personalized Pangenome References

Pangenomes, by including genetic diversity, should reduce reference bias by better representing new samples compared to them. Yet when comparing a new sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with using allele frequency filters. However, this is a blunt heuristic that both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach, inspired by local ancestry inference methods, that imputes a personalized pangenome subgraph based on sampling local haplotypes according to k-mer counts in the reads. Our approach is tailored for the Giraffe short read aligner, as the indexes it needs for read mapping can be built quickly. We compare the accuracy of our approach to state-of-the-art methods using graphs from the Human Pangenome Reference Consortium. The resulting personalized pangenome pipelines provide faster pangenome read mapping than comparable pipelines that use a linear reference, reduce small variant genotyping errors by 4x relative to the Genome Analysis Toolkit (GATK) best-practice pipeline, and for the first time make short-read structural variant genotyping competitive with long-read discovery methods.

1 Data sources

Reads
We used Illumina Novaseq, and Element sequencing data from the following links: • Novaseq wgs pcr-free data: gs://brain-genomics-public/research/sequencing/fastq/ novaseq/wgs_pcr_free For benchmarking structural variants, we used Genome in a Bottle (GIAB) Tier1 v0.6 confident regions lifted to GRCh38 annotation as in [6], Challenging Medically Relevent Genes (CMRG) v1.0 and a whole genome BED file from [5] where centromeric regions were removed.We searched for good haplotype sampling parameters by iterating an early version of the variant calling pipeline with 30x NovaSeq reads for HG002.The graph we used was an unreleased Minigraph-Cactus HPRC graph with some but not all of the improvements in the v1.1 graph.We evaluated the results by comparing the F1 scores separately for indels and SNPs, with the frequencyfiltered graph as the baseline.See Supplementary Table 1 for the progression of the F1 scores.Unless otherwise mentioned, all graphs with sampled haplotypes also include reference paths.First we tried various combinations of k-mer scoring parameters while sampling 8 haplotypes.After three rounds and 68 combinations (see Supplementary Table 2), we selected 0.9 as the multiplicative discount for homozygous k-mers, 0.05 as the additive adjustment for heterozygous k-mers,
Next we locked the k-mer scoring parameters and tried sampling n ∈ {2, 4, 6, 8, 10, 12, 14, 16} haplotypes.We found that n = 4 was the best choice for indels and n = 6 for SNPs.Again, the differences between the F1 scores of the winning values were marginal.
Then we validated the parameter choices by sampling n ∈ {4, 6, 8} haplotypes for HG003, HG004, and HG005 and compared the results to the frequency-filtered graph.We used 30x NovaSeq reads for HG003 and HG004 and 20x, 30x, and 50x reads for HG005.In all cases, sampling either 4 or 6 haplotypes was the best choice.
Finally we tried diploid sampling from n ∈ {4, 8, 16, 32} candidates for HG002.Sampling from 32 candidates was the best choice for indels, while n = 16 and n = 32 were equally good for SNPs.We also tried sampling from 32 candidates without including reference paths, but the variant calling performance was worse than with reference paths.As seen in the command, we used deepvariant:1.5.0 for our analysis.This step generates .vcf.gz file that is ready for evaluation.

Evaluation using Hap.py
We used hap.py:v0.3.12 for our evaluation analysis using the following command:

Genotyping structural variants
For genotyping structural variants (SVs), we used two tools vg call v1.49.0 [3] and PanGenie v2.5.0 [1] not only to compare the two but also to validate the robustness of the various rounds of evaluation for the different graphs and call sets we tested for.The main distinction between the two lies in the model adopted to detect SVs.In fact, vg call leverages information from reads alignment performed with Giraffe [7] for then augmenting the output file and, finally, calling SVs according to the reads' coverage derived from Giraffe; a more in-depth explanation of the entire process is available in the respective papers [7,3].PanGenie adopts an HMM to infer the best possible combination of haplotype paths within the pangenome graph imputed from the short reads' dataset for the sample in question.The resulting genome inference, in VCF format, has to be converted into biallelic sites for the subsequent evaluation to work best; more details on how to run PanGenie can be found on the GitHub page for the tool (PanGenie git) as well as in the corresponding paper [1].
In both cases, the output VCF for the HG002 sample was later evaluated with truvari v3.5.0 [2] exact parametrization, as in [6], optimized for benchmarking using pangenome graphs.This was done consistently across all series of experiments with pangenomes and personalized graphs, truth sets from the GIAB and the T2T consortium and with or without BED regions.

Command line
These are the commands to generate viable VCF files with vg call and PanGenie, respectively.Next, truvari was used for SVs evaluation.

vg call
After reads mapping with vg giraffe, the vg call workflow includes two additional steps:  vcfbub -l 0 -a 100000 -i pangenie_ready.vcf.gz| vcfwave -I 10000 -t <n_of_threads> -n > decomposed_and_filtered.vcf → However, to ensure the former steps can be executed properly it is necessary to remove haploid sequences form the graph VCF (e.g.CHM13) and artificially make each sample a diploid, phased field in the VCF file; this second operation is not needed when using the v1.1 diploid graph.awk -F '\t' '/^#/ {print;next;} {OFS="\t"; print $0, $10 "\t" $11 "\t" $12 "\t" $13}' HG002_pangenie-spl.vcf> HG002_pangenie-dup.vcfAfterwards, PanGenie can be run in pipeline mode, passing through Snakemake, so that the resulting VCF file has a structure compatible with the script for biallelic conversion.This entails having a working CONDA version (Anaconda Software Distribution.Computer software.Vers.3-23.3.1.Anaconda, Mar.2023.Web.) with Mamba installed (version used 1.2.0) which is mandatory to run Snakemake (version used 7.30.1).Both PanGenie and Snakemake have to be placed in a separate CONDA environment and upon installing and building the tool, at this location path/to/pangenie/pipelines/run-from-callset, it will be present a config.yamlfile to be filled in with relevant information.Once done, PanGenie can be run after activating the Snakemake environment and invoking the following from the folder where the config.yaml is saved: In general, the config.yamlfile will ask for a graph VCF generated as previously described, a reference file (GRCh38 in this case), the reads dataset for the sample in question (HG002 in this set of analyses), the paths to the binaries for the tool and an output directory.Chromosomes' name must be identical both in the reference and in the graph VCF for the genome inference to be performed correctly; it is good practice to leave the reference untouched and change the name of chromosomes in the VCF file with this single command line: 1 # renames chromosomes in VCF (based on a GRCh38 graph) 2 sed -i 's/GRCh38#0#//' decomposed_and_filtered.vcf Further details on how to run PanGenie can be found on the GitHub page of the tool, where there are also specifications on how to carry out the biallelic conversion with an ad-hoc Python script.To be noted, PanGenie output will be stored in a folder at the location path/to/output/directory/genotypes; this is the VCF file benefiting from biallelic conversion before evaluation.

Variants evaluation
At last, variants evaluation is done running truvari like so: where the option for including a BED file can be toggled on or off whether or not the analysis conducted was using it; other parameters were left the same as [6] to attain the best performance when benchmarking against pangenome graphs.

Table 1 :
Progression of F1 scores in various stages of parameter search.For diploid sampling, the number of haplotypes indicates the number of candidates.

Table 3 :
Properties of the different graphs.Note that there are 12 single-node connected components in the v1.1 graphs, corresponding to small unplaced contigs, which are not counted as "top-level" chains in the vg software.

Table 4 :
Runtime comparisons for both PanGenie and vg call on the v1.1 diploid graph.